Missing Data

Row / column missing patterns

Sample dataset with missing values

library(tidyverse)
# Add NAs to mtcars dataset
set.seed(5702)
mycars <- mtcars
mycars[,"gear"] <- NA
mycars[10:20, 3:5] <- NA
for (i in 1:10) mycars[sample(32,1), sample(11,1)] <- NA

Row / column missing patterns

Missing values by column

colSums(is.na(mycars)) %>%
  sort(decreasing = TRUE)
## gear disp   hp drat   am  cyl qsec   vs  mpg   wt carb 
##   32   12   12   11    3    1    1    1    0    0    0

Row / column missing patterns

Missing values by row

rowSums(is.na(mycars)) %>%
  sort(decreasing = TRUE)
##          Merc 450SE         Merc 450SLC Lincoln Continental            Fiat 128 
##                   5                   5                   5                   5 
##            Merc 280           Merc 280C          Merc 450SL  Cadillac Fleetwood 
##                   4                   4                   4                   4 
##   Chrysler Imperial         Honda Civic      Toyota Corolla        Lotus Europa 
##                   4                   4                   4                   3 
##         AMC Javelin           Fiat X1-9           Mazda RX4       Mazda RX4 Wag 
##                   2                   2                   1                   1 
##          Datsun 710      Hornet 4 Drive   Hornet Sportabout             Valiant 
##                   1                   1                   1                   1 
##          Duster 360           Merc 240D            Merc 230       Toyota Corona 
##                   1                   1                   1                   1 
##    Dodge Challenger          Camaro Z28    Pontiac Firebird       Porsche 914-2 
##                   1                   1                   1                   1 
##      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
##                   1                   1                   1                   1

Row / column missing patterns

heatmap

geom_tile()

Row / column missing patterns

heatmap

mi::missing_data.frame()

library(mi)
x <- missing_data.frame(mycars)
image(x)

(gear not shown since all are missing)

Row / column missing patterns

Missing values by variable

geom_tile()

Row / column missing patterns

Missing values by variable

geom_tile() with standardized variables

Row / column missing patterns

Missing values by variable

reordered by number of missing

Row / column missing patterns

Missing values by variable

Row / column missing patterns

Missing values by variable

missing patterns

x <- mi::missing_data.frame(mycars)
## NOTE: In the following pairs of variables, the missingness pattern of the second is a subset of the first.
##  Please verify whether they are in fact logically distinct variables.
##      [,1]   [,2]  
## [1,] "disp" "drat"
## [2,] "disp" "qsec"
## [3,] "hp"   "drat"
## [4,] "hp"   "qsec"
## [5,] "hp"   "am"  
## [6,] "drat" "qsec"
class(x)
## [1] "missing_data.frame"
## attr(,"package")
## [1] "mi"
x@patterns
##  [1] nothing              nothing              nothing             
##  [4] nothing              nothing              nothing             
##  [7] nothing              nothing              nothing             
## [10] disp, hp, drat       disp, hp, drat       disp, hp, drat, am  
## [13] disp, hp, drat       disp, hp, drat, qsec disp, hp, drat      
## [16] disp, hp, drat, am   disp, hp, drat       cyl, disp, hp, drat 
## [19] disp, hp, drat       disp, hp, drat       nothing             
## [22] nothing              vs                   nothing             
## [25] nothing              disp                 nothing             
## [28] hp, am               nothing              nothing             
## [31] nothing              nothing             
## 8 Levels: nothing vs disp hp, am disp, hp, drat ... cyl, disp, hp, drat
levels(x@patterns)
## [1] "nothing"              "vs"                   "disp"                
## [4] "hp, am"               "disp, hp, drat"       "disp, hp, drat, am"  
## [7] "disp, hp, drat, qsec" "cyl, disp, hp, drat"
summary(x@patterns)
##              nothing                   vs                 disp 
##                   18                    1                    1 
##               hp, am       disp, hp, drat   disp, hp, drat, am 
##                    1                    7                    2 
## disp, hp, drat, qsec  cyl, disp, hp, drat 
##                    1                    1

Aggregated missing patterns

(repeated patterns are reduced to one row)

Aggregated missing patterns

Sorted by most common to least common missing pattern (top to bottom)

Aggregated missing patterns

Sorted by variable with the most to least missing values (left to right)

Aggregated missing patterns

NYC School data

DBN School Name Number of Test Takers Critical Reading Mean Mathematics Mean Writing Mean
01M292 Henry Street School for International Studies 31 391 425 385
01M448 University Neighborhood High School 60 394 419 387
01M450 East Side Community High School 69 418 431 402
01M458 SATELLITE ACADEMY FORSYTH ST 26 385 370 378
01M509 CMSP HIGH SCHOOL NA NA NA NA
01M515 Lower East Side Preparatory High School 154 314 532 314
## [1] 460   6

NYC School data

Value patterns in missing data

NYC School data

Does the proportion of schools with missing data vary by borough?

Data: SAT2010.csv

DBN School Name Number of Test Takers Critical Reading Mean Mathematics Mean Writing Mean
01M292 Henry Street School for International Studies 31 391 425 385
01M448 University Neighborhood High School 60 394 419 387
01M450 East Side Community High School 69 418 431 402
01M458 SATELLITE ACADEMY FORSYTH ST 26 385 370 378
01M509 CMSP HIGH SCHOOL NA NA NA NA
01M515 Lower East Side Preparatory High School 154 314 532 314

Missing by borough

Borough num_schools num_na percent_na
K 141 32 0.23
Q 73 14 0.19
M 111 13 0.12
X 124 14 0.11
R 11 1 0.09

Missing by borough

K, Q, M, X, R

Manhattan

Brooklyn

Queens

The Bronx

Staten Island

Missing by borough

K, Q, M, X, R

Manhattan = New York County

Brooklyn = Kings County

Queens = Queens County

The Bronx = Bronx County

Staten Island = Richmond County

Missing by borough

K, Q, M, X, R

Manhattan = New York County = “M”

Brooklyn = Kings County = “K”

Queens = Queens County = “Q”

The Bronx = Bronx County = “X”

Staten Island = Richmond County = “R”

Missing by borough

Borough num_schools num_na percent_na BoroughName
K 141 32 0.23 Brooklyn
Q 73 14 0.19 Queens
M 111 13 0.12 Manhattan
X 124 14 0.11 The Bronx
R 11 1 0.09 Staten Island

Missing by borough

Missing by borough and scores

EXAMPLE: Snowfall

I need to know how much snow fell per day in New York State in February 2017, on a county or more detailed level. I know some government agency measures and reports snowfall and puts the data online.

Snowfall

I need to know how much snow fell per day in New York State in February 2017, on a county or more detailed level. I know some government agency measures and reports snowfall and puts the data online.

Source: https://www.ncdc.noaa.gov/snow-and-ice/daily-snow/NY/snowfall/20170201

Accessed: 2017-10-26

Number missing by county mean

Number missing by day of month

Number of missing values per day by total snowfall

no gridlines